import requests
payload = {
'api_key': 'API_KEY',
'query': 'iphone 15 charger',
's': 'price-asc-rank'
}
response = requests.get('https://api.scraperapi.com/structured/amazon/search',
params=payload).json()Data Prep / EDA
Where the data source, processing, and visualization (EDA) is presented.
Data Collection
Amazon product information was scraped from the website using the API service ScraperAPI; this is because, as Amazon is a hugely popular website, they have many anti-scraping measures in place such as rate-limiting, IP blocking, dymamic loading, and such. Using the external API service, these limitations were able to be avoided. The search queries chosen to search for items were based on top 100 Amazon searches, found on this site and this site. An example of using the API, along with its core endpoint, is below.
The jupyter notebook code for the web scraping can be found here.
Additionally, more data was used to supplement the existing data. Since the scraped data was only about 26K rows, a Kaggle dataset was used that contains more than one million rows, had around the same fields as the scraped data, and was also from the USA (many Amazon Kaggle datasets were from the non-US).
The raw data from both sources can be seen below in Table 1; the scraped raw data CSV can also be viewed here.
| type | position | asin | name | image | has_prime | is_best_seller | is_amazon_choice | is_limited_deal | stars | total_reviews | url | availability_quantity | spec | price_string | price_symbol | price | original_price | section_name | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | search_product | 7 | B0BTBN4VP1 | Evaporative Portable Air Conditioners, 700ML P... | https://m.media-amazon.com/images/I/71AHsuMgEZ... | False | False | False | False | 2.6 | 112.0 | https://www.amazon.com/Evaporative-Portable-Co... | NaN | {} | NaN | NaN | NaN | NaN | NaN |
| 1 | search_product | 43 | B09YSZ26TY | Charmin Ultra Strong Toilet Paper 12 Mega Roll... | https://m.media-amazon.com/images/I/8189tH7KLX... | False | False | False | False | 4.8 | 372.0 | https://www.amazon.com/Charmin-Ultra-Strong-To... | NaN | {} | NaN | NaN | NaN | NaN | NaN |
| 2 | search_product | 35 | B09Y8NK2F3 | BirkenstockArizona Waxy Narrow Leather Sandals | https://m.media-amazon.com/images/I/71eaJXKHjJ... | False | False | False | False | 5.0 | 3.0 | https://www.amazon.com/Birkenstock-Arizona-Nar... | NaN | {} | $276.95 | $ | 276.95 | NaN | NaN |
| 3 | search_product | 40 | B0CV9H8J4C | PRETYZOOM 3pcs Shoe Storage Rack Foldable Stor... | https://m.media-amazon.com/images/I/51yydWjKNs... | False | False | False | False | NaN | NaN | https://www.amazon.com/dp/B0CV9H8J4C/ref=sr_1_... | NaN | {} | $15.78 | $ | 15.78 | NaN | NaN |
| 4 | search_product | 6 | B09N6PLBDX | 2pcs Car Interior Center Console Decoration Fr... | https://m.media-amazon.com/images/I/61iZE2SlFi... | False | False | False | False | NaN | NaN | https://www.amazon.com/TINKI-Interior-Decorati... | NaN | {} | $630.00 | $ | 630.00 | NaN | NaN |
| asin | title | imgUrl | productURL | stars | reviews | price | listPrice | category_id | isBestSeller | boughtInLastMonth | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | B0B63CFNPY | JINTUM 414A Toner Cartridge (with CHIP) Compat... | https://m.media-amazon.com/images/I/71CGA50Aes... | https://www.amazon.com/dp/B0B63CFNPY | 4.2 | 572 | 139.99 | 149.99 | 72 | False | 0 |
| 1 | B08FJCGXKR | 4 Pieces Words Christmas Clear Stamps Set Sili... | https://m.media-amazon.com/images/I/81UlHQvTTJ... | https://www.amazon.com/dp/B08FJCGXKR | 4.5 | 0 | 10.99 | 0.00 | 5 | False | 50 |
| 2 | B07NPXM3G4 | 5 Pair of Half Insoles - Shoe Filler, Half-Siz... | https://m.media-amazon.com/images/I/81Q2IsZYdw... | https://www.amazon.com/dp/B07NPXM3G4 | 3.9 | 0 | 9.95 | 0.00 | 131 | False | 400 |
| 3 | B0C7CGPGZC | Disney Cars Mini Racers 15-Pack Variety Radiat... | https://m.media-amazon.com/images/I/81ND3IJWyC... | https://www.amazon.com/dp/B0C7CGPGZC | 5.0 | 0 | 64.95 | 0.00 | 236 | False | 0 |
| 4 | B07SH21322 | Crinkle Cut Paper Shred Filler Royal Blue & Go... | https://m.media-amazon.com/images/I/81ORI0gpx3... | https://www.amazon.com/dp/B07SH21322 | 4.1 | 0 | 10.95 | 0.00 | 12 | False | 50 |
Data Cleaning
The datasets were cleaned seperately, then concatenated, then some final steps were taken to clean it.
The steps to clean the web-scaped data were:
- Add
date_scrapedcolumn - Remove unecessary columns:
type,position,has_prime,is_amazon_choice,is_limited_deal,availability_quantity,spec,price_string,price_symbol,section_name - Expand and fix
original_price - Rename columns to match standard snake case for merging both datasets
- Drop rows with no asin or name or price
- Drop rows with price of 0.0, since that doesn’t make sense
- Fill NaN
reviewscolumn with 0
The steps to clean the Kaggle data were:
- Add
date_scrapedcolumn - Remove unecessary columns
boughtInLastMonth - Drop rows with any NaNs
- Fix
list_priceof $0 to be instead equal toprice - Change
category_idto actual category by usingcategorytable - Drop rows with price of $0, since that doesn’t make sense
- Rename columns to match standard snake case for merging both datasets
And then, after they were concatenated, the steps to clean were:
- Remove duplicates (by asin + date scraped)
- Rename columns
The final cleaned (and concatenated) dataset can be seen in Table 2 (with the original raw data in Table 1):
| Asin | Name | Image Url | Is Best Seller | Stars | Reviews | Url | Price | Date Scraped | List Price | Category | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | B0C6TLWQ2V | Pllieay 5 in 1 Plastic Cross-Stitch Hoops, Pla... | https://m.media-amazon.com/images/I/51T8EHasQR... | False | 4.3 | 0.0 | https://www.amazon.com/dp/B0C6TLWQ2V | 20.99 | 2023-11-01 | 21.99 | Needlework Supplies |
| 1 | B08GL3CVLF | Toddlers and Baby Boys' Swimsuit Trunk and Ras... | https://m.media-amazon.com/images/I/81XD7QCQb7... | True | 4.8 | 0.0 | https://www.amazon.com/dp/B08GL3CVLF | 23.80 | 2023-11-01 | 23.80 | Baby Boys' Clothing & Shoes |
| 2 | B0BMG7TTLV | Hi Vis Rain Jacket, Class 3 High Visibility Ra... | https://m.media-amazon.com/images/I/61NmeMU7vk... | False | 4.2 | 0.0 | https://www.amazon.com/dp/B0BMG7TTLV | 70.99 | 2023-11-01 | 70.99 | Safety & Security |
| 3 | B0CFQW79K8 | Tea Infuser Strainers for Loose Tea, Stainless... | https://m.media-amazon.com/images/I/71vpVzzzbg... | False | 4.7 | 0.0 | https://www.amazon.com/dp/B0CFQW79K8 | 7.99 | 2023-11-01 | 7.99 | Kitchen & Dining |
| 4 | B01M25PXTZ | Younik Vertical Stand for PS-4 Slim, Built-in ... | https://m.media-amazon.com/images/I/71i8zRx4vv... | False | 4.6 | 0.0 | https://www.amazon.com/dp/B01M25PXTZ | 11.99 | 2023-11-01 | 11.99 | Legacy Systems |
The code for the data cleaning can be found here.
Data Preprocessing / Visualization
Various types of EDA were performed in order to examine the data; as a note, most visuals are interactive (zoomable, pannable, etc). The code for all visualizations can be found here.
If the interactive figures don’t load, dont worry: just turn off all ad-blockers/privacy browsing, make sure you are using Chrome/Firefox, and refresh the page until all figures load.